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REGULARIZATION IN KERNEL LEARNING 

By Shahar Mendelson-*^ and Joseph Neeman 

The Australian National University and Technion, 
I.I.T. and University of California, Berkeley 

Under mild assumptions on the kernel, we obtain the best known 
error rates in a regularized learning scenario taking place in the cor- 
responding reproducing kernel Hilbert space (RKHS). The main nov- 
elty in the analysis is a proof that one can use a regularization term 
that grows significantly slower than the standard quadratic growth 
in the RKHS norm. 

1. Introduction. Let F be a family of functions from a probability space 
(r2,/z) to M. A classical problem of learning theory is the following: we set u 
to be an (unknown) probability measure on $7 x R whose marginal distribu- 
tion on is /X. Given n independent samples (Xi, Yi), . . . , (X„,l^) G 17 x M, 
distributed according to u, our task is to find a function f G F such that 

(1.1) E(/(Xi) - Yif - inf E(/(Xi) - Yif 

is very small. In other words, we want to approximate the distribution u 
by a function from F as closely as possible. Specifically, we want to find 
a method of choosing / as a function of the sample {Xi,Yi)^^-^ such that, 
with high probability, (1.1) is smaller than a function of n that tends to zero 
as n grows. In this paper, we will consider the case where $1 is a compact 
Hausdorff space and Yi is bounded almost surely. 

A widely used approach to solving this problem is to consider a function 
f G F that minimizes the functional 
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over all f & F. Such a function is called an empirical minimizer and its 
properties have been widely studied (see, e.g., [2, 3, 8, 16, 19] and references 
therein). It turns out that the complexity and geometry of F play a large 
part in determining whether (1.1) is small. Roughly speaking, if F is a small 
family of functions, then (1.1) will be, with high probability, a function of 
n that decreases polynomially fast. 

Of course, there is a disadvantage to having a small family of functions, 
namely, that mif^FE,{f{Xi) — Yi)^ becomes larger as F becomes smaller. 
This trade-off is known as the bias-variance problem. The expression (1.1) is 
known as the sample error and miff.FE{f{Xi) - Yi)^ is called the approx- 
imation error. 

One major issue that needs to be addressed when using the empirical 
minimization algorithm is overfitting. Since all of the information that one 
has is on the behavior of the minimizer on the sample, there is no way to 
distinguish a "simple" minimizer from a more complicated one. The regular- 
ized learning model is a method of solving the bias-variance problem while 
addressing the overfitting problem. We take -F to be a very large function 
class (so that the approximation error is small) and consider a function / 
that minimizes the functional 

n 

1=1 

where 7n(/) measures, in some sense, the "complexity" of the function of / 
and, for a fixed /, 7n(/) — s- as n — t- oo. Thus, if two functions have the same 
empirical behavior, then the algorithm will choose the simpler function of 
the two. 

A common example of the regularized learning problem, and the situ- 
ation we will be considering in this article, is the case where the class of 
functions is a reproducing kernel Hilbert space (RKHS), which is defined 
below and will be denoted throughout this article by H. All of the error 
bounds in this situation (with the exception of one result, discussed later, 
in the classification setting) were restricted to a regularization term of the 
form 7n(/) = ??n||/||//5 with the goal being to choose r]n so that the error is 
as small as possible. As far as we know, it has not even been conjectured 
that one could improve the power of ||/||// in the regularization process. 
Doing just that is the main goal of this article. 

One can motivate the regularized learning model by looking at it as a col- 
lection of empirical minimization problems. Indeed, let Bh be the unit ball 
of the space H and consider the empirical minimization problem in rBn for 
some r > 0. As r increases, the approximation error for rBn decreases and 
its sample error increases. We could achieve a small total error by choosing 
the correct value of r and performing empirical minimization in rBn- The 
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role of the regularization term 7n(/) is to force the algorithm to choose the 
correct value of r for empirical minimization. We will explain later why this 
motivation can be made rigorous and that the regularization problem may 
be solved by a solution to a hierarchy of minimization problems. 

It should be clear from this motivation that the choice of 7„ is critical for 
the success of the regularized learning model. There has been some signifi- 
cant work done recently on finding explicit formulas for 7^ that provide low 
error rates with high probability. Of particular importance to us, because 
their results are directly comparable to ours, are the works of Caponnetto 
and De Vito [7], Smale and Zhou [31] and Wu, Ying and Zhou [36]. We will 
mention these results in Section 3, in order to compare them to ours. Re- 
cent work on regularization parameters for support vector machines includes 
that of Blanchard, Bousquet and Massart [5]; and Steinwart and Scovel [29]. 
Although there are important differences between our problem and that of 
support vector machines, there are certain similarities, and some tools — 
model selection results and localized complexity parameters, for example — 
are useful for both subjects. 

Our starting point is the realization that analysis based on Loo-bounds 
(even if done in a subtle way) is too loose and is the sole source of a quadratic 
regularization term. Using Loo-bounds is very tempting in our case because 
of a convenient fact about reproducing kernel Hilbert spaces: if the kernel 
is bounded, then there is a constant ck such that ||/||oo < cx||/||/f- This 
allows one to bound ||(/ — 1^)^11 00 ^ II/IIh' which can be used to control the 
"complexity" of the loss class through concentration inequalities (such as 
Bernstein's for a single function or Talagrand's for a class of functions) that 
depend on the Loo-norm of functions. What is more significant is that it 
allows one to apply contraction inequalities at the cost of a multiplicative 
factor — the Lipschitz constant of the loss function on its domain — which 
is, for the squared loss, twice the maximal Loo-norm of a class member. 
This approach leads to a regularization term of ||/||^ and it is by avoiding 
gratuitous use of Loo-bounds that we improve that term. 

It should be noted that in the classification setting (to be more precise, in 
the example of support vector machines), Blanchard, Bousquet and Massart 
[5] showed that a regularization term of the form r/nll/Hi? was possible. Their 
approach unfortunately does not extend to the regression case: they still rely 
on Loo -bounds and they obtain a linear regularization term because the loss 
function in a support vector machine setup is i{x,y) = max{0, 1 — xy} and 
||max{0, 1 — /(X)y}||oo is linear in \\f\\H (instead of quadratic, as is the case 
for the squared loss). On the other hand, it is conceivable that our technique 
could be applied to the classification setting, lowering the exponent of \\f\\H 
further still. 

The starting point of our analysis is the notion of isomorphic coordinate 
projections, introduced in the context of learning theory in [3]. Suppose 
that F is a family of functions for which the infimum infjg^E(/(X) — Y)'^ 
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is achieved; call the minimizer /* and define the excess loss function to be, 
for any f £ F, 

4 iX,Y) = (/(X) - Yf - inX) - Yf. 

When the underlying class is clear from the context, we will omit the su- 
perscript F. Denote by P the conditional expectation with respect to the 
sample, 

PCf=E{Cf\Xi,Yi,...,Xn,Yn) 

and let Pn^f = SiLi ^/("'^j' -^j)- One can show (see [3] or Theorem 2.2) that 
there is some (small) number p„ such that, with probability at least 1 — e~^, 
every f G F satisfies 

(1.2) ^PnCf- Pn<PCf<2PnCf + Pn■ 

We will refer to equations like (1.2) as giving "almost isomorphic coordinate 
projections" because (1.2) tells us that the structures imposed on F by P 
and Pn are, up to a small additive term, isomorphic. This is a useful approach 
for bounding the error of the empirical minimizer. Indeed, it is not hard to 
see that it implies that 

EifiX) - Yf - inf E(/(X) - Yf = PCf < 

It turns out that this isomorphic coordinate projection approach applies 
to regularized learning as well as to empirical minimization. The main result 
in this direction is due to Bartlett [1] and implies that if every ball rBn 
satisfies an almost-isomorphic condition, then it is possible to establish a 
regularized learning bound. This is an example of a model selection result 
because it proves that the regularized learning procedure somehow selects 
an appropriate model {tBh for a good choice of r) from a family of models 
(the set of models {rBn'-r > 1}). Of course, model selection results have 
been used previously in the study of regularized learning; the use of an 
almost-isomorphic coordinate projection condition, however, first occurred 
in [1] and it is crucial here. Some examples of model selection results for 
problems similar to ours can be found in [5] (Theorem 4.3), [16] and [21]. 

Theorem 1.1 [1]. For each f £ H, let Cf denote the loss of f relative 
to the ball 

£;(x,y) = £f ii^«(x,y) = {f{x) - Yf - irix) - Yf, 

where f* = argmin||g||<||j|| E,{g{X) — Yf . Under some conditions on 7n(-); 
if, for every f £ H , 

\PnCf - -fnif) <PCf< 2PnCf + 7„(/), 
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then the regularized minimizer satisfies 

E(/(X) - Yf < inf ((/(X) - Yf + nnic'f)), 

where c and d are absolute constants. 

Thus, if one could establish sharp "isomorphic coordinate projections"- 
type estimates for every excess loss class {Cf.f € rBn}, then this would 
yield regularization bounds. 

It is important to emphasize that although at first glance, the problem of 
obtaining isomorphic bounds for kernel classes has been solved in the past 
(based on, e.g., estimates from [2, 22]), this is far from being the case. The 
isomorphic bounds for kernel classes have been studied for the base class 
F = Bh (i.e., r = 1), using an Loo-based argument that includes contraction 
inequalities. In contrast, the essential ingredient required for our analysis 
(and which determines the regularization parameter) is the way in which 
these bounds scale with the radius r. In all of the previous isomorphic results 
obtained in the context of kernel classes, the way that the bounds depend on 
r was not important and thus never addressed. And, moreover, the analysis 
used to obtain those results gives a suboptimal estimate as a function of 
r: an estimate that scales like because of the Loo-based method. Indeed, 
one factor of r in this quadratic growth follows from a contraction argument 
combined with the fact that the maximal Loo -norm of functions in rBfj is 
r. The second factor of r appears because the "complexity" of the class rBn 
grows linearly in r. 

The consequences of this are clear: since one can identify the regulariza- 
tion term with the way isomorphic coordinate projection estimates for the 
class rBn scale with r, the regularization term of ||/|||^ is an artifact of the 
Loo -based method of analysis that leads to a bound that grows like r^. 

Let us mention that if the Lipschitz constant of the loss is bounded by 
an absolute constant, as is the case for support vector machines and the 
hinge loss, one factor of r can be removed by the Loo-based method be- 
cause the Lipschitz constant of the loss is uniformly bounded. Thus, one 
can use contraction inequalities freely for that problem and obtain a linear 
regularization term; this is the result in [5]. 

Our analysis will show that the standard regularization bounds, which 
grow like r^, where r = are very pessimistic and may be improved 

considerably. Moreover, if we set the regularization term as ^/nZ^(||/||_ff)) we 
will establish the best known bounds on ?7„ as well (both results will require 
mild assumptions on the kernel). 

There are two reasons for the improved bounds. The first is a method that 
allows one to bypass the whole Loo-based mechanism and this is presented 
in Section 4. We shall present a general bound on the empirical process 
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indexed by the localized excess squared loss class associated with a base 
class consisting of linear functionals on £2 of norm at most r. This step will 
lead to a removal of one factor of r from the term — the one that was due 
to an Loo-based method and a contraction argument. 

Second, the ability to employ the "isomorphic" approach allows one to 
use localization techniques. Thus, the effective complexity of the excess loss 
class is caused only by excess loss functions with a relatively small variance; 
by virtue of the geometry of rBff, that set of excess loss functions happens 
to come from a rather small subset of rBn- Recall that, intuitively, the 
second factor of r comes from the linear growth of the "complexity" of rBn . 
However, the actual "isomorphic" estimate for rBn is determined by the 
complexity of the intersection bodies XB2 H rBn, rather than by that of 
rBn [where B2 is the unit ball of L2(/i)]. It turns out that for a reasonable 
RKHS, the complexity of such an intersection body grows at a much slower 
rate as a function of r. Indeed, the number of "meaningful directions" in rBn 
(when considered as a subset of L2) is small and decreases quickly with r. 
Therefore, the true complexity of rBn will be sublinear in r because, as r 
increases, an ever smaller number of directions will actually grow with r and 
influence the complexity. 

Formally, we will show that if the eigenvalues of the integral operator Tk 
decay like ©(r^/P) for some <p <l, then one can obtain an isomorphic 
bound with pn that scales like 

max{^2/{i+p)^^2/p| 

for 9 ~ r^n~^/^logn. This translates to a regularization term of 
where, again, r = 

In this result, one still has a regularization term that grows like r^; never- 
theless, this is a considerable improvement on the Lqo -based result. Because 
it decays faster as a function of the sample size n, the r^/n term seems 
superfluous because one would expect it to be dominated by the first term. 
Indeed, in Section 5, we will show that it can be removed: under the same 
assumption on the decay of the eigenvalues of Tk as above, one may use a 
regularization term (up to logarithmic term) of 

j.2p/{l+p) 

nV(i+p) ' 

which is the best known dependency on r and n. 

We will end this introduction with the formulation and a short discussion 
of our main result. To avoid defining them twice, let us mention that the 
space Ip^oo and its norm || * UpcxD sltq includGci in Definition 3.3. 
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Assumption. Assume that ||K(a;,a::)||oo < 1 and that the eigenvalues of 
the integral operator Tk satisfy (A„)^^ G ip^oo for some < p < 1. Assume, 
further, that there is a constant A such that the eigenfunctions {^Pn)n>i of 
Tk satisfy sup„||v9„||oo < ^ < oo. 

Theorem A. Let K he a continuous, symmetric, positive definite kernel 
on VL, a compact Hausdorff space, and set H to be the corresponding repro- 
ducing kernel Hilhert space. IfY is bounded almost surely and the assumption 
above is satisfied, then there exist constants ci, C2 and C3 that depend only 
on A, p and ||(Aj)||p^oo; 0, constant cy that depends only on ||5^||oo o.'n-d a 
constant N depending only on ||5^||oo CLn-d p for which the following holds. 
Let 



Let us begin our discussion with the assumptions. The assumption that 
1 1 -^11 00 < 1 is purely cosmetic: any continuous kernel on a compact space 
is bounded and the assumption only prevents unnecessary constants from 
appearing. The assumption on the decay of the eigenvalues is essentially a 
smoothness condition for the kernel; the existence, for example, of a contin- 
uous derivative would be enough. We will discuss the eigenvalue assumption 
in more detail later. For now, let us just say that it has been used before [7] 
in discussing the way in which smoothness of the kernel affects the learning 
rates. 

The assumption that the eigenfunctions ipn are uniformly bounded is 
more serious. It has been made before — in [35], for example, in which it 
was mistakenly claimed that such an assumption holds for all Mercer ker- 
nels. Zhou, [37], however, argues against this assumption and provides an 
example of a kernel without uniformly bounded eigenfunctions. Let us 
remark, therefore, that we do not need the full strength of this assumption. 
Indeed, as the proof will reveal, it is enough to have some < e < 1/2 such 
that sup„ 1 1 1 1 00 is bounded. The theorem then remains true if we assume 



V{f, u) = C3(l + u + cy Inn + lnlog(||/||i^ + e)) 




If n> N and ci log log n <u < C2(logn)^/(^ p\ then, with probability at least 
1 — exp(— m/2), every minimizer f of 

Pnif + f^lV{f,u) 



satisfies 
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that (A^^^*^) G £p^oo instead of (A„) G ^p,oo- Note that sup„ \/A^||¥'n||oo < oo; 
for our assumption to hold, we need to be able to take a power of that 
is strictly smaller than 1/2. This is a considerably weaker assumption than 
that of uniformly bounded eigenfunctions. For instance, the example given 
in [37] of a C°° kernel without uniformly bounded eigenfunctions satisfies 
our weaker condition for any e > 0: the eigenvalues decrease exponentially 
faster than the Loo-norms of the eigenfunctions. 

For an example of a kernel satisfying our assumption, let k be an even 
function of period 1 and set K{x, y) = k{x — y). If is the Lebesgue measure 
on [0,1], then it is easily seen, via a cosine expansion of k, that the eigen- 
functions of K are sine and cosine functions and hence bounded uniformly. 
The periodic Gaussian kernel is an example of such a kernel. 

As a final remark on the assumptions, let us point out that one can triv- 
ially construct examples of kernels that satisfy them: just take ipn to be a 
suitably smooth orthonormal basis of L2{fJ.) and choose A^ to be a sequence 
that decreases sufficiently rapidly. Then K{x,y) = Yl'^=i^n^n{x)^n{y) sat- 
isfies our assumptions. 

Regarding the theorem, there are some aspects of practical interest that 
we do not address. First, there are constants in the theorem that we have 
made no attempt to compute. Furthermore, these constants depend on quan- 
tities that may not be known (e.g., ||y||oo)- One might hope, however, to use 
applied statistical techniques — cross-validation, for example — to find plausi- 
ble values for the constants. In that case, one should note that the constant 
cy in the definition of V can be moved to the front of the definition without 
changing the validity of the theorem (see Remark 2.6); that way, the applied 
statistician has only one unknown constant to worry about. 

We conclude this introduction with a brief discussion of the error rate of 
Theorem A; more detailed discussions follow at the ends of Sections 3 and 5. 
The formulation of Theorem A is attractive because it shows that we find the 
almost-minimizer (in some sense) regardless of how well our hypothesis class 
approximates the regression function E(y|X). To be concrete, however, we 
can make an assumption about how the approximation error behaves and 
derive explicit error bounds as a function of n. The assumption made in [9] 
(and elsewhere) is that there exists some < o" < 1/2 such that E(y|X) is 
in the range of on L2(/u). For a = 1/2, this implies that E(y|X) G H; for 
smaller cr, it somehow says that E(y|X) can be approximated reasonably 
well by elements in H. Under this assumption, we obtain an error rate of 
(ignoring logarithmic factors and the confidence term, u) n~'^'^ / As 
stated above, a detailed discussion follows in later sections; for now, we will 
just mention that the above rate is significantly faster than the rate of •nT'^l'^ 
that was obtained in [31]. 

Regarding the optimality of this error rate, we have very little to say. 
Minimax lower bounds on the error rate are given in [7], but only when the 
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regression function E(y|X) belongs to H (and their proof does not easily 
extend to the more general case considered here). In a very specific case 
(when o" = 1/2 and one cannot take a > 1/2), our rates match those in [7]. 
We can claim, therefore, that our results are optimal in a very specific sense; 
in the more interesting region < a < 1/2, however, we cannot make any 
such claim. 

2. Preliminaries. We begin with a word about notation. We will de- 
note absolute constants (i.e., fixed, positive numbers) by c, ci, . . . , etc. Their 
values may change from line to line. Absolute constants whose values will 
remain unchanged are denoted by ki,K2^.... By c(o), we mean that the 
constant c depends only on the parameter a. We write a ~ 5 if there ex- 
ist absolute constants ci and C2 such that cia <b < C2a, and o ~p 6 if the 
equivalence constants depend on the parameter p. 

Arguably the most important tool in modern empirical processes theory 
is Talagrand's concentration inequality for an empirical process indexed by 
a class of uniformly bounded functions [18, 33]. The version of this concen- 
tration result which we shall use here is due to Massart [20]. 

Theorem 2.1. There exists an absolute constant C for which the fol- 
lowing holds. Let F be a class of functions defined on (0,/^) such that for 
every f £ F, ||/||oo ^ ^ <ind Ef = 0. Let Xi, . . . ,Xn be independent random 



variables distributed according to ji and set a =nsupjgpE/ . Define 



Z- 



:SUpV/(X,) 



and Z ■■ 



- sup 



1=1 



f{X,) 



Then, for every x > and every p> 0, 

Pr({Z > (1 + p)EZ + aVCx + C(l + p~^)bx}) < e" 

Pr({Z < (1 - p)EZ - aVCx - C{1 + p^^)bx}) < e" 
and the same inequalities hold for Z. 



Throughout this article, we denote by i{x,y) = {x — y)^ the squared loss 
function. When / is a function — )■ M and Y is some target random variable, 
we define if = lf{X, Y) = {f{X) - Yf. If F is a class of functions, let £j = 

Cf{X, Y) = {f{X) - Yf - {f*{X) - y)2, where /* = argmin/e^ E£/ (we will 
usually drop the superscript F). Of course, we assume that this minimizer 
exists and is unique, which is the case, for example, if F is compact (in L2) 
and convex. Cp denotes the class of functions{£j : / G F}. 

For a class of functions F on a probability space {Q,p), we set 

1 " 

-Y.f{X.)-Ef, 



\P..-P\\ 



sup 
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where (Xj)j=i are independent, distributed according to ji. 
For any x > 0, define the locahzed excess loss class 

C^ = {CfMCf<x} 

and set 

V = star(/:F, 0) = : < ^ < 1, / G F}, 
V^ = {eCf:Q<e<l,¥.{eCf)<x} = {h(^st&i{CF,Q):'&h<x} 

[where, for a set T, star(r, 0) = {9t :O<0< 1, t G T} is the star-shaped hull 
of T and 0]. 

The following "isomorphic" result is similar in nature to the one proved 
in [3]. The bound from Theorem 2.2 normally leads to an estimate on the 
error of the empirical minimizer, but in [4] and here, it will serve a different 
purpose. This isomorphic result will enable us to control the solution of the 
regularized learning problem in the context of kernel learning. 



Theorem 2.2. There exists an absolute constant c for which the fol- 
lowing holds. Let Cp be a squared loss class associated with a convex class 
F and a random variable Y . If b = maxjsupjg^ ll/lloo ||^||oo} CLiT'd 2; > 
satisfies 

E\\Pn-P\\v,<x/8, 
then, with probability 1 — exp(— n), for every f £ F, 

(2.1) ip„£; -^-c{l + b^)-<PCf<2PnCf + ^ + c(l + b^)-. 

Proof. By Talagrand's inequality, there exists an absolute constant C 
such that, for every a > 0, with probability at least 1 — e~", 

\\Pn-P\\v^<2nPn-P\\v^ + ( — ] sup 7VaF^+ . 

It is standard to verify (see, e.g., [19]), that there exists an absolute con- 
stant C such that, for a convex class F, every Cf £ Cp satisfies IE£j < 
Cb'^KCf. Thus, every g £Va satisfies Var (7 < Cb'^a. Fix x satisfying E||P„ — 
P\\Vx ^ 2;/8 and set 

a = max -fx, 2bC- — — ^1. 
I n J 

Note that, because V is star shaped, a>x implies that Va C ^Vx and so 
E||P„ - P\\v^ < f E||P„ - P\\v, < a/8. Therefore, with probability at least 
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nil / " (r.b'^OluV/^ Chu 



A \ n I n 



a a a 

(2.2 <- + - + — 

^ ' - 4 5 25 

- 2 

Consider the event in which (2.2) holds. Fix some £/ G Cp- If PCj < a, 
then £j S and so 

PnCf-^<PCf<PrXf + ^, 

and (2.1) holds. If, on the other hand, PCf = (3 > a, then let g = ^Cf and 
note that geVa- Thus, by (2.2), 

^Pg = Pg-^<Pn9<P9 + ^< 2Pg. 
Since £/ is a constant multiple of g, we have 

^PCf<P^Cf<2PCf 

and so (2.1) holds once again. 

To conclude, (2.2) implies that (2.1) holds for all £/ £ Cp- Thus, (2.1) 
holds with probability at least 1 — e~". □ 

Remark 2.3. The claim of Theorem 2.2 holds under milder assump- 
tions. Note that the assumption that F is convex is there to ensure that 
Pif attains a unique minimum in F and that the excess loss class satisfies a 
Bernstein-type condition: that for every f £ F, ^jO-'j < CKCj. One can show 
that if F is convex, then, for any function f £ F, E£j < c||/||^E£j. Hence, 
if F is convex and G is a subset of F that contains the minimizer in F of 
Pif, then the analog of Theorem 2.2 will be true for {Cg '.g G G}. 

The first part of our analysis will be to show that this isomorphic infor- 
mation can be used to derive estimates in regularized learning. 

2.1. From isomorphic information to regularized learning. The regular- 
ized learning model provides a method for learning in a very large class of 
functions without suffering a large statistical error. As we mentioned in the 
Introduction, obtaining an "isomorphic" result for a hierarchy of classes can 
lead to estimates in the regularized learning model. This approach was in- 
troduced in [1] and was formulated in the way we will use here in [4]. Since 
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this last article has not yet appeared, we present a proof of the result we 
need in the Appendix. 

Let be a class of functions and suppose that there is a collection of 
subsets {Fr;r > 1} with the following properties: 

1. {Fr : r > 1} is monotone (i.e., whenever r < s, Fj. C Fg); 

2. for every r > 1, there exists a unique element /* G Fr such that Pif* = 

inf/sF.-P^/; 

3. the map r^Pij* is continuous; 

4. for every tq > 1, Clryro = Fr^; 

5. [Jr>lFr = F. 

Definition 2.4. Given a class of functions F, we say that {Fr;r > 1} 
is an ordered, parameterized hierarchy of F if the above conditions 1-5 are 
satisfied. Define, for f £ F, 

r(/) = inf{r>l;/eF,,}. 

Note that, from the semicontinuity property of an ordered, parameterized 
hierarchy (property 4), it follows that / G -^r(/) for all f £ F. 

From the second property of an ordered, parameterized hierarchy, we can, 
for r > 1 and f £ Fr, define Crj = (/ - Yf - (/; - Yf. That is, Crj is the 
excess loss function with respect to the class Fr- 

Theorem 2.5. There exist absolute constants ki and K2 such that the 
following holds. Suppose that {Fr;r> 1} is an ordered, parameterized hier- 
archy and that pn{r, u) :[l,oo) x (0,oo) — )■ (0,oo) is a continuous function 
(possibly depending on the sample) that is increasing in both r and u. Sup- 
pose, also, that for every r > 1 and every u > 0, with probability at least 
1 — exp(— u), 

rj ~ Pnir-, u) < PCrJ ^ '^Pn^r,f + Pn{l~, u) 

for all f £Fr. 

Then, for every u> 0, with probability at least 1 — exp(— u), any function 
f € F that minimizes the functional 

Pnif + KiPn{2r{f),e{r{f),u)) 

also satisfies 

Pi^ < M{P£f + K2Pn{2r{f),9{r{f),u))), 

where 

7r2 / Pir \ 

6 V /0„(l,2; + log(7rV6)) J 
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Remark 2.6. In fact, the proof of Theorem 2.5 reveals something shghtly 
stronger: if Pn(^) u) is a continuous, increasing function in both variables such 
that 

Pn{r,u) > pn{2r,9{r,u)) 
for every r, u and n, then every function / that minimizes the functional 

Pn^f + Ki/0„(r,n) 

satisfies 

P^^-< mf (P£/ + K2p„(r,n)). 

In other words, we can always regularize with a larger regularization term; 
we will obtain a correspondingly larger error bound. We will use this fact 
later. 

The conclusion of Theorem 2.5 can be reformulated in a way that makes 
the traditional distinction between the approximation and sample errors 
more explicit. We begin by defining an approximation error term by 

A{r) = inf P£f. 

Then A{r) — inif^pPif tends to zero as r — )• oo and the rate of this conver- 
gence measures how well the ordered, parameterized hierarchy approximates 
Y. Smale and Zhou [30] study this approximation error in a variety of con- 
texts, including the case in which we are interested: when Fr is the ball of 
radius r — 1 in a reproducing kernel Hilbert space. 

Corollary 2.7. Under the assumptions of Theorem 2.5, with probabil- 
ity at least 1 — exp(— u), 

P£f < mf{A{r) + Hi2Pn{2r,9{r,u))). 

Proof. Let u > 0, fix e > and choose an s > 1 such that 

A{s) + K2P„(2s, e{s, u)) < inf {A{r) + K2pn{2r, e{r, u))) + J. 

Consider g £ Fg such that PCg < A(s) + e/2. Since pn is increasing in both 
of its arguments, we have 

PCg + K2Pn{2r{g),e{r{g),u)) < inf {A{r) + K2/)„(2r, 0(r, u))) + e. 

r>l 

However, we can find such a function g for every e > 0. Therefore, 

inf + K2/9n(2r(/),e(r(/),n))) < inf (^(r) + K2Pn{2r,e{r,u))) 

j£F r>l 

and the conclusion follows from Theorem 2.5. □ 
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3. Regularization in kernel classes. The case that we will be interested 
in is when Fr is a multiple of the unit ball of an RKHS. For more details on 
properties of an RKHS that are relevant in the context of learning theory, 
we refer the reader to, for example, [8]. 

Let n he a compact Hausdorff space, consider A' : x — M, a positive 
definite, continuous function and, without loss of generality, assume that 
ll-f^^lloo < 1- Let Tk be the corresponding integral operator, Tk ■ L2{fi) — )■ 
L2(/x), defined by 



By Mercer's theorem [8], there is an orthonormal basis of eigenfunctions 
of Tk, corresponding to the eigenvalues (Xi)'^^ arranged in a non- 
increasing order, such that 



where the convergence is uniform and absolute on the support of /i x [10] 
(and hence there is also convergence in L2). 

The RKHS, which will be denoted throughout by H, can be identified with 
linear functionals in £2. Indeed, consider the function <1> : — )■ ^2 defined by 
^(x) = {y/Xifiix))'^i. For every t £ £2, define the corresponding element of 
H by = t); we define the RKHS H to be the image of I2 under the 

map ft with the induced inner product {ft, fs) h = {t, s) . This definition 
of H is phrased differently from those given in [8, 10], but it is easily checked 
that the resulting Hilbert space of functions is the same. Hence, to study 
properties of a subset of H, it is enough to study the corresponding set 
of linear functionals, as a set T C £2 uniquely determines Ft = {ft:t £ T}. 
Here, we will mostly be concerned with T = rB2, corresponding to = rBn, 
where Bh is the unit ball of the RKHS and B2 is the unit ball of £2- In this 
case, the measure endowed on £2 is given by ^{Z), where Z is distributed 
in Q according to /.f. 

3.1. Classes of linear functionals: The L^o approach. Our first approach 
to the problem of regularized learning in an RKHS will lead to a regulariza- 
tion term of U/]]^. As stated in the Introduction, this is over-regularization, 
which is an artifact of the analysis of the learning problem. It stems from 
the way that the Loo-bound on functions in is used and the fact that the 
only way to bound IJ/^/Ul^ is by 11>Cj]]l^ < c]]/]]|^. In this section, we will 
use this (loose) approach, but still obtain better error estimates than those 
previously known — although still using a regularization term of U/Uf^. We 
will obtain considerably better results in the following sections. 




00 




i=l 
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The idea we will use is to obtain an isomorphic result for the hierarchy 
Fr = tBh (in our ^2 representation, corresponds to rB2). We then use 
Corollary 2.7 for the function pn given by the isomorphic analysis. 

In our presentation, we will study the following, more general, situation. 
Let T C £2 be a compact, convex, symmetric set and consider a random 
vector on £2 [distributed, recall, according to <&(.Z')]. Denote by ft = {t,-) 
the linear functional defined by t and put 

D = {t: EftiXf < 1} = {t : E(t, < !}• 

Thus, D is the image of the L2 unit ball in the parameter space £2- 

Our first, Loo-based, approach to the problem of learning in an RKHS 
relies on the following bound, which was implicit in [22] (to be precise. 
Theorem 3.1 follows from the proof of Theorem 3.3 in [22] if one keeps track 
explicitly of the constants Ci, and C^). 



Theorem 3.1. There exist constants c and c' depending only on ||y||oo 
for which the following holds. Let Vr^x = {aCf;0 < a < 1, / G rBu^'&Cf < x}. 
Then for every r > 1 and every x > 0, 



E||P-P„||v;,, <crE sup 

{i6rB2nv^D} 



1 " 

-y^g,ft{X,) 



1=1 



where the gi are independent standard Gaussian variables. In the case where 
r = 1, we have 



E sup 

{te-B2nv^D} 



i=l 



< c' ( — y~^min{x, Aj} 



1/2 



i=l 



The proof of the first part of Theorem 3.1 uses a comparison theorem, re- 
lating the Gaussian process t — ^11=1 ft i-^i ,Yi), conditioned on (Xj , Yi)'^^-^ 
to the conditioned Gaussian process t — )• 'Ylll=i9ift{Xi). This is done using 
an Loo -bound since 

n n 

^{Cf, - £/j2(X„y,) = J2ift - fs)\X,) . {{ft + f,){X,) - 2Y,f 

i=l i=l 

n 

<4{r+\\Y\U'^{ft-fsf{X,), 
1=1 

which will turn out to be the main source of the quadratic regularization 
term 

From Theorem 3.1, one obtains the following. 
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Corollary 3.2. There exists a constant c, depending only on ||l^||oo; 
such that if z > satisfies 



1/2 



then, for all r >1, 

->E||P-P„||y , 

where x = r'^z. 



Proof. Define 



ipr{x) = rE, sup 

{terB2n^D} 



By the second part of Theorem 3.1, we can choose c such that ipr{x) < ^ 
(where c is the constant from Theorem 3.1). Furthermore, it is easily checked 



that iprix) = r'^ipi{xr ^) for any x and r. That is, ipri'f'^x) 
The claim now follows from the first part of Theorem 3.1. 



= r2Vi(x)<^. 
□ 



With this corollary and Theorem 2.2, we can obtain an isomorphic con- 
dition on the unit ball of an RKHS using information on the decay of the 
eigenvalues. For the sake of concreteness, we will make the following assump- 
tion on this rate of decay; this assumption will allow us to compute an error 
bound explicitly. 



Definition 3.3. For < p < 1 and a nonincr easing, nonnegative se- 
quence {Xi)^i, define 

||(A.)||p,oo = supii/PAi. 

i>l 

Hence, for any x > 0, 

(3.1) \{K>x}\<\\{Xi)\\p,ooX-P. 
If ||(Ai)||p,oo < OO, we win say that (A^) G ip^oo- 



Assumption 3.1. Let be a kernel on a compact probability space 
(QxQ, fix fj,) where fi is a Borel measure and Q. C M"^. Assume that ||i^(x, 3;)||oo < 
1 and that the eigenvalues of the integral operator Tk satisfy (A„)^^ G ip^oo 
for some <p <1. 
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Since / K{x^ x) d^{x) = Yl'^i have (Aj) G £1^00 when K{x^ x) G Li{fj.). 

The stronger Assumption 3.1 is satisfied under some smoothness condition 
on the kernel. Suppose, for example, that the kernel K belongs to some Besov 
space -B2 00 particular, this is the case if a G N and G C"(r2 x fi)]. If 
17 C M'^ is locally the graph of a Lipschitz function and ^ is a Borel (prob- 
ability) measure on 0, then, by Theorems 4.1 and 4.7 of [6] (see also [17]), 
the sequence (Aj) belongs to ip^oo for 

1 

^~ a/d+1/2' 

A similar assumption on the decay of the eigenvalues was made in [7] . The 
Loo assumption on K{x,x) is only to simplify the presentation and any 
uniform bound instead of 1 would do. 

The assumption on the rate of decay of the eigenvalues allows us to obtain 
the following bound. 

Lemma 3.4. For <p < 1, there is a constant Cp depending only on p 
such that for all x > and all r > 0, 

00 

5^min{x,r2Aa < Cp\\{Xi)\\p,^x'-Pr'P. 

i=l 

Proof. It suffices to prove the lemma for r = 1 and the result will follow 
for all r by homogeneity. Set = \{Xi > x}| and observe that for all x > 0, 

00 00 00 

^min{x,Ai} = xiV^+ ^ Xi < \\{\i)\\p,ooX^~^ + 

1=1 i=N^+l i=N^+l 

The first term is in the required form. Let us deal with the second term: 

00 00 
E A,<||(A,)||p,oo E 

<cpii(A.)ii,,ooivr^/^ 

<Cp\\{X^)\\p,ooXP-' 

as required. □ 

With the preceding bound on ^ min{x, r^Aj}, we can rewrite Corollary 
3.2 in a nicer form that is specialized to our application; recall that K-,x is 
the localization at level x of the star-shaped hull of the shifted loss class of 
rBn ■■ Vr,x = {aC / : < a < 1, / G rBH,EC f<x}. 
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Corollary 3.5. Let K be a kernel that satisfies Assumption 3.1 for 
some < p < 1 . There exists a constant Cp depending only on p such that if 
z = Cp(Mk£i)i/(i+p)^ then, for all r>l, 

->E||P-P„||y , 

g — M M yr,x ' 

where x = r'^z. 

Having controlled the quantity, E||P — -PnllK.i) ^^at can give us the "iso- 
morphic coordinate projection" result we wanted, we are almost in a position 
to prove our first result; it only remains to show that we can apply the model 
selection result. 



Lemma 3.6. Let H be the reproducing kernel Hilbert space associated 
with a continuous, symmetric, positive definite kernel K . Set F = H and 
define, for every r>l,Fr = {r — 1)Bh, where Bh is the closed unit ball of H. 
Then {Fr;r > 1} is an ordered, parameterized hierarchy and r(f) = \\f\\ + 1. 

Proof. The first, fourth and fifth properties of an ordered, parameter- 
ized hierarchy are immediate. The second property follows from the fact that 
Bh is convex and compact with respect to the L2-norm (because it is an 
ellipsoid whose principal lengths decrease to zero). For the third property, 
fixl<g<r<s and let (3 = a = fEj. Note that af* £ Fr and 13 f* £ Fg. 
Thus, 

< P£f, - Plf, < Pi^f, - Plf, = (a2 - l)P{f:f + 2(1 - a)Pf:Y. 

As s — )■ r, the right-hand side tends to zero (because the candidates for /* 
are uniformly bounded in L2) and so r — )• Pif* is upper semicontinuous (the 
same argument works for r = 1). In the other direction, 

< p^/. - Piff < p^pff - Piff < {(3^ - i)P{f:? + 2(1 - /3)p/;f 

and the right-hand side tends to zero for the same reason as before. □ 

Combining Theorem 2.2 with Corollaries 3.5 and 2.7, we obtain the fol- 
lowing error bound for regularized learning in an RKHS. 

Theorem 3.7. There exist absolute constants ki and K2, constants cy 
and dy depending only on ||l^||oo CL'^^d a constant Cp depending only on p such 
that the following holds. Let K be a kernel satisfying Assumption 3.1 and 
define 

Pn{r,u)=Cpr \ -Fcy(l-hr )-. 



n n 
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Then, for every u > 0, with probability at least 1 — exp{—u), any function 
f G F that minimizes the functional 

PrJf + KiPn{r{f),u) 

also satisfies 
where 

Pn{r,u) 
In particular, 



Plf<\ni{A{r) + K2Pn{r,u)), 



r>l 



: Pn ^2r, n + In ^ + 2 ln(l + Cyn + log r)^ . 



P^f < mf ( A{r) + c ( ^i/(_i_^p) H — {u + log n + log log(r + e)) 

where c = c(p, ||y||oo, ||(Ai)||p,oo)- 

Proof. By Theorem 2.2 and Corollary 3.5, the function Pn{r, x) satisfies 
the condition of Theorem 2.5 [where we set cy = c(l + H^H^)]- We can then 
apply Corollary 2.7 to obtain the result. Since £ Fr for any r > 0, we have 
Pifj < Pio = \\Y\\l, . and /)„(1, n + ln(7rV6)) > 4/n so that 



L2M 

Cyn, 



p„(l,x+ln(7rV6)) 
to which we apply Remark 2.6. □ 

Let us compare the estimate on the regularization term and the resulting 
error rate that follows from this theorem to previously obtained bounds on 
regularized learning in an RKHS. Since all of the results we consider have 
exponentially good confidence, we will simplify this comparison by ignoring 
the confidence term and focusing on the decay of the error bound as the 
sample size increases. In order to facilitate our comparison further, we will 
make an assumption that allows us to control the approximation error A{r). 



Assumption 3.2. Suppose that there exists < cr < 1 such that Tj^'^'K{Y\X) 
belongs to -L2. 

Recall that Tk is the integral operator that defines our RKHS H. Note 
that for £7 = 0, the assumption is trivial and for c > ^, the assumption states 
that E(y|X) G H [and so A{r) = for large enough r]. For < a < ^, the 
assumption tells us the degree to which E(y|X) can be approximated by 
functions in H. Indeed, a result of Smale and Zhou [30] (see also [10]) allows 
us to bound the approximation error in terms of a, as follows. 
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Theorem 3.8 [30]. // Assumption 3.2 holds for < a < ^, then 

\ 4cr/(l-2(T) 

Our main points of comparison are the rates from [31], Corollary 5: 

^ X <7/{l + 2<7) ^ 



(3.2) Pef-mfPif<< 



n J 1 

ly" 1 

n 2 



Suppose, first, that Assumption 3.2 holds for cr > ^. As we already men- 
tioned, this implies that the approximation error is eventually zero and so 
our result gives an error rate like 

P^f- mi Plf<{ -] 
f f(zH J ^ \n J 

which is an improvement over (3.2), even if p = 1. In fact, [7] shows that this 
rate is optimal in some sense. Interestingly, [7] also shows that one can get 
even better rates for a > ^, that is, when the regression function not only 
belongs to the hypothesis class, but also satisfies some extra smoothness 
properties. 

For a < I, set k = 4cr/(l - 2a). We can then choose r = n^/(^^+P)^'^+'')) 
and our error bound becomes 



Pif - inf Pif < ^(nV((i+P)(2+'=))) + „2/((i+p)(2+/c)) ( 1 \ 

2a/{l+p) 



(3.3) < (- 

Once again, this improves on (3.2), even when p=l. 

The situation p < 1 is more interesting because the kernels used in learning 
theory often have some smoothness properties. If i^T G C°°, for example, then 
we can choose p arbitrarily small and recover the following result of Wu, Ying 
and Zhou [36]: 

^ \ 2a-t 



Pif - inf Pif<\- 

for any e > 0. We will see, however, that the techniques of the next section 
will improve on this for cr <\. 
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4. Toward a smaller regularization parameter. The bound (3.3) would 
be substantially improved if we could remove the term and replace it by 
a smaller power of r — which is the main source of novelty in this article. As 
mentioned before, the most significant source for this improvement comes 
from bypassing Loo-based bounds. In recent years, there has been consider- 
able progress made on bounding various empirical processes that are indexed 
by sets that are either not bounded or very weakly bounded in Loo- Most 
of these results were motivated by questions in asymptotic geometric anal- 
ysis, most notably, sampling from an isotropic, log-concave measure (e.g., 
[15, 25, 28]) and the approximate reconstruction problem [14, 23]. The fact 
that such an approach is called for here seems strange because we are dealing 
with a learning problem relative to a class of uniformly bounded functions, 
so it would seem that there is no reason to employ techniques designed to 
handle an unbounded situation. Even more so, because in a standard learn- 
ing analysis, the way the error bounds depend on the Loo-diameter of the 
class is usually of no real importance. In contrast, here, the way the isomor- 
phic results scale with the Loo -bound is extremely important because one is 
trying to obtain a result for the entire hierarchy and the Loo-diameter of 
is directly linked to the hierarchy parameter r. Thus, the standard, and very 
loose, approach which is commonly used in a single class situation can cause 
real damage in our case because the regularization term will be strongly 
influenced by the way that the Loo-diameter enters into the bounds. 

To see where one can improve upon the standard Loo analysis (in a very 
"hand- waving" way) , let us return to the localized Gaussian process indexed 
by {t:E£jj < x] nri?2, conditioned on the data {Xi,Yi), that is, 

n n 
t^Y.S^^ h = ^9^{t-f,X,){{t + t*,Xi)- 2Yi), 

1=1 1=1 

where ft* minimizes the loss in ri?2- For every t, the variance of each con- 
ditioned Gaussian variable satisfies 

(n \ n 

Y,mt^ft{Xi,Yi) \=Y^{t-t\Xi)\{t + t\Xi)- 2Yif. 
i=l J i=l 

Consider some t for which lE£/t < x. One can show that in this case, \\t — 
< ^/x (see Lemma 4.1 below). Now, if one has a very strong concentration 
phenomenon and if D = B2, then 

= ((^,x.^ + ((r,x,)-y,))' 

Since the expected loss of the best in the class only decreases with r, this 
term is of the order of x, rather than a factor that grows quadratically in r, 
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which is the estimate that results from the Lqo approach. This at least hints 
at the fact that the Lqo approach is likely to lead to very loose estimates. 

Despite the fact that the above paragraph is totally unjustified as stated 
and very optimistic, it turns out that this scenario is very close to the actual 
situation (although the proof requires a rather delicate analysis). 

4.1. Further preliminaries. For technical reasons, we will make an addi- 
tional assumption on the eigenfunctions of the kernel. We should emphasize 
that it is possible that this assumption may not be necessary to obtain the 
improved regularization term, although we were not able to remove it here 
and it has a crucial role in our analysis. 

Assumption 4.1. Let be a kernel on a compact probability space 
{Q, X Q,, X fj.) with Q C M'^. Assume that there is a constant A such that 
the eigenfunctions of K satisfy sup„ ||Vn||oo < ^ < oo. 

Let us recall from the Introduction that we still obtain a result if we 
assume instead that there exists e > with sup„ A^||(^„||oo < ^ < oo. As 
discussed in the Introduction, our results hold with this weaker assumption 
if we modify Assumption 3.1 so that (A^^^*^)^]^ G ^p,oo- 

Recall that the feature map $ defines an isometry from an RKHS into £2- 
Let T C ^{H) be a centrally symmetric, convex, compact subset of £2- The 
first step in our analysis is to relate the localized sets Cx (corresponding to 
the class {ft :t G T}) to subsets of T. Since this fact appeared implicitly in 
several places (see, e.g., [24], Corollary 3.4) and in more general situations, 
for example, loss functions that are uniformly convex rather than the squared 
loss, we omit its proof. 

Lemma 4.1. Let t* = argmin^g-rE^jj. For every x> 0, 
{t-t* -.tGTXft^ ^x} C 2^D n 2T. 

Lemma 4.1 shows that it is sufficient to consider the complexity of the 
sets ^/xD n T. The complexity parameters we shall use come from a generic 
chaining argument (defined below) and thus a significant part of our analysis 
will be based on covering numbers. 

Definition 4.2. Let A,Bc£2 - Denote by N{A, B) the smallest number 
of translates of B needed to cover A. If eB is a ball of radius e with respect 
to some norm, then N{A,£B) is the minimal cardinality of an e-cover of A 
with respect to that norm. If (A, d) is a metric space (rather than a normed 
one), we denote the cardinality of a minimal e-cover of A by N{A,e,d). 
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The generic chaining mechanism (see [34] for the most recent survey on 
this topic) is used to relate probabihstic properties of a random process 
indexed by a metric space to the metric structure of the underlying space. 
This mechanism originated in the study of Gaussian processes t ^ Xt, where 
it was proven that Esup^g-j"^* equivalent to a metric invariant of {T,d) 
for d{s,t) = (E\Xs — Xtp)^/^. This so-called majorizing measures theorem 
(in which the upper bound of the equivalence was proven by Fernique [12] 
and the lower by Talagrand [32]) was later developed into a more general 
theory with many interesting applications [34]. The metric invariant that is 
at the heart of this theory is the 72 functional, which we define as follows. 

Let (T, d) be a metric space. An admissible sequence of T is a collection of 
subsets of T, {Tg : s > 0}, such that for every s > 1, [Tsl = 2^ and |To| = 1. 

Definition 4.3. For a metric space (T, d), define 



where the infimum is taken with respect to all admissible sequences of T 
and d{t, T) = inf d{t, u) . 

Definition 4.4. A random process t ^ Xt indexed by a metric space 
{T,d) is sub-Gaussian relative to d if, for every s,t£T and every u>l, 



The generic chaining mechanism can be used to show that if {Xt : t £ 
(T, d)} is sub-Gaussian, then there is an absolute constant c such that for 
every to G T, 



and similar bounds hold with high probability. 

Note that one choice for sets Ts that constitute a potential (yet, usually 
suboptimal) admissible sequence are e^-covers of T, where each Eg is selected 
in a way that ensures that N{T,£s,d) < 2^ . An easy computation [34] then 
shows that 



00 



72 (T, d) = inf sup 





Esnp\Xt-Xt,\<c-f2{T,d) 



(4.1) 




where c is an absolute constant. This is a generalization of Dudley's entropy 
integral (see, e.g., [11, 34]), used in the study of Gaussian processes. As 
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will be explained later, this integral bound can be improved under certain 
assumptions on the geometry of T if d is endowed with a norm. 

The metric d we will focus on here is a random one and depends on the 
sample Xi, . . . , X„ C £2- For every Xi, . . . , X„, set 

doo,n(/,9)= max 

l<«<n 

Recall that our function class H is isometric to £2 under the map ft. 
Thus, (ioo,n defines a random norm on a projection of I2 which is given, with 
some abuse of notation, by 

doo,n{s,t) = max |(Xj,s - 

l<i<n 

Next, let Un{T) = (E7|(r, doo,n))^/^ and, for every x > 0, set 

(pnix) = —j=— ■lnaiX\^/x,^/ELt*, — -j=— I, 

where = Tn y/xD C £2 and t* is the parameter in T for which inf^gT^ IE£/-j 
is attained. 
Recall that 

C^ = {Cf:ECf<x} 

and that 

= {eCf.O < e < l,E{eCf) < x} = {h £ staiiCF,0) :Eh < x}. 

From Theorem 2.2, it is clear that in order to obtain a useful "isomorphic" 
result, one has to bound IE||P„ — P\\vx as a function of x; this is done in the 
following theorem. Since it is a modification of a result that was proven in 
[4], we will only present an outline of its proof. 

Theorem 4.5. There exists an absolute constant c for which the follow- 
ing holds. If T <Z £2 and H = {ft -.t £T}, then, for every x > 0, 



nPn-p\\v.<cj2'^~'M^ 



i=0 



The proof of Theorem 4.5 relies on the following "peeling" lemma, which 
shows that one can control E||P„ — P|| on the star-shaped hull of a class of 
functions if one can control E||P„ — P|| on "shells" of the original class. 

Lemma 4.6. For every x > 0, 

00 

- P\\v. < 2^2-^E||P„ - P\\c,.^,^. 

1=0 
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Proof. Note that for every x > 0, 

w^ = {eCf:0<e<i,E{eCf) <x,ECf>x} 

— --ECf >x,0<t<x 



EC 



f 



= \J\:^:2'x<ECf<2'+^x,0<t<x\ = \JW,,,. 

i=0^ ' i=0 

litCf/ECf G VF,,^, then t/ECf < 2'' and Cj G C^.+i^. Thus, \\Pn-P\\w,,^ < 

Finally, let Wq^x = star(£2:,0). Note that — P||tyo^ < \\Pn — P\\c:^ and 
that Vx C Wq U Wq^x, from which our claim follows. □ 

Outline of the proof of Theorem 4.5. Fix x > 0. First, one can 
verify that the Bernoulli process indexed by Cx , given by t — )■ "^^^i £i^ft i^i ) ^] 
conditioned on (Xj,!^)"^^ is sub-Gaussian with respect to the metric 

/ n n \ 1/2 

d{ft,Jt,) = dooAftiJt2){ sup ^{Xi,vf + J2^t*iX„Y,)] . 

This follows from Hoeffding's inequality [which says that the process is 
sub-Gaussian with respect to d{Cf^,Cg^)] and a computation to show that 
d{C , ) is smaller than the above quantity. Hence, if we set K = y/xD D T, 
then, by the Gine~Zinn symmetrization method [13], followed by a generic 
chaining argument, we have 

/ / n n \ l/2\ 

E||Pn - P\\C. < ( 72(K, doo,n) ( SUp X,)^ + ^ Ct* (X„ F,) 1 1 • 

Moreover, one can show (see, e.g., [14]) that if is a class of functions, then 



E sup 

h&H 



- Eh' 



i=l 



< C2 max{^/naHUn{H), U^{H)}, 



where ajj = supf^^jjEh^ . In particular, for H = {{t, •) : i G K}, 

n 

Esup 

because E(t, < x. Now, a straightforward computation shows that 

E\\Pn-P\\c^<>Pn{x). 
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To conclude the proof, note that by Lemma 4.6, it is possible to estimate 

nPn-P\\v^ using E\\Pn-P\\c,,^. □ 

Observe that the sets T we will be interested in are rB2 since they are 
the images of rBn in £2- The rest of this section will be devoted to finding 
a bound on {^^(x) for these sets T. 



4.2. Controlling for T = ri?2- It is clear that is determined by the 
structure of the sets K^^r = y/xDr]2rB2 C £2- To study the metric properties 
of these sets, we first have to identify D. 

Consider the random variable Z on Q distributed according to /i and let 
X = ^{Z) = Yl'jZi £ ^2 be the random feature map. Clearly, 

D = {t£ £2:K{t,Xf <l}={t£ £2:E{t, <^>{Z)f < 1}. 

Since {ipi)^i is an orthonormal system in i^2(/w), we have 

00 

i,j i=l 

Hence, D is an ellipsoid in £2 with the standard basis (cj)^]^ as principal 
directions, and lengths l/\/Ai. 

It is straightforward to verify that for every x,r > 0, there is an ellipsoid 
£x,r such that Kx^r = '2rB2riy/xD satisfies ^£x,r C Kx^r C £x,r- The principal 
directions of Kx^r and £x,r coincide and the principal lengths of £x,r are 



cmin 




where c is an absolute constant. 

The structure of the ellipsoids £x,r indicates that it should be possible to 
obtain a sublinear dependency on the radius r and the fact that we were not 
able to do so in Section 3.1 is an artifact of the suboptimal analysis that was 
used there. The sublinearity occurs because for a > 1, £x,ar is much smaller 
than a£x,rl since it is an intersection body, it only grows in some directions 
and the number of directions in which it grows decreases quickly with r. 

Now that we have identified the intersection body, we are ready to esti- 
mate 

Un = {E-fl{£x,r,d^,n))^/^. 
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Theorem 4.7. There exists an absolute constant c for which the follow- 
ing holds. Suppose sup„ Hv^nlloo < ^ o-nd set 



Before proving the theorem, we need two additional facts. The first is an 
improved "Dudley entropy integral" bound, due to Talagrand. 

Theorem 4.8 [34]. There exists an absolute constant c for which the 
following holds. If £ d is an ellipsoid and B is the unit ball of some 
norm \\ ■ \\ on M"*, then 




Another standard fact we need is the dual Sudakov inequality [26]. 

Lemma 4.9. There exists an absolute constant c for which the following 
holds. Let Be be the unit ball of some norm on and let B^ be the 
Euclidean ball on M™. Then, for every e > 0, 



where G = (gi, . . . ,gm) is a standard Gaussian vector on M™". 

Proof of Theorem 4.7. Fix Xi,...,Xn and note that in order to 
bound ^2i£x,r,doo,n), it suffices to consider the projection of the (infinite- 
dimensional) ellipsoid £x,r onto the subspace spanned hj Xi,. . . , Xn- Hence, 
one can apply Lemma 4.9. Set \\v\\e = maxi<j<„] (?;, Xj) ] and let Be be the 
unit ball {v £ £2- \\v\\e < !}• Consider the ellipsoid £x,r C £2 with principal 
directions (ej)^^ and lengths 6i = c\ minly^x/Ai, r}. Let T be the operator 
Tcj = QiCi so that TB2 = £x,r- For every e > 0, 



and V G eT'^Be if and only if maxi<j<„](f ,T*Xj)] = maxi<j<ril(w,rXj)] < 
e. Hence, if we set Wi = TXi and \\v\\e = maxi<j<„](z;, Wj)] (with the cor- 
responding unit ball B^ = {v:\\v\\e ^^}), then 




Then 



(E7|(^:,,,,doo,n))'/' < cQ{x,r)\ogn. 






N{TB2, eBe) = N{B2,eT~^BE) 



N{TB2,eBE) = N{B2,eBE) = N{B^,eBE), 
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where, here, by B2, we mean the unit baU in the subspace of £2 spanned by 

Let G be a standard Gaussian vector on M". Then, by Slepian's lemma 
[11, 27], 



E||G||^ = E max\{G,TXi)\ < C2 Vbgn max ||rXi||2. 

l<i<n l<i<?i 

Since T is a diagonal operator and Xj = Yl'^i 'we have 

00 00 00 

j=l i=\ i=\ 

Hence, setting 

/ 00 \ 1/2 

Q = Q{x, r) = AI minjx, r'^Xj} , 



.4 = 1 



it is evident that 



(4.2) E||G||^<C2Vlog^Q 
and by Lemma 4.9, for every e > 0, 

log n 



logN{B^,eBE)<C3- 



£ 



2 



In particular, the diameter of B2 with respect to the norm || • ||^ is at most 
cQyJlogn, and we denote this diameter by D2. 

This estimate for the covering numbers will be used for "large" scales of 
e. For smaller scales, we need a different argument. Applying a volumetric 
estimate (see, e.g., [27]) for every norm || • \\x on M" and every e > 0, we 
have N{Bx,£Bx) < (5/e)". Thus, for every 0<e<6, 

log NiB^,eBE) < \ogN{Bl^,5BE) + \ogN{5BE,eBE) 
<C3^^+nlog(-l. 



If we take 5"^ = csQ^-^^, then it follows that for e < c/^Q-\J\ogn/n = eo, 

logiV(Bj,eS^)<nlog(eo/e). 
Now, by Theorem 4.8, for every Xi, . . . , X^, 

/"OO /"OO 

ll{£x,r,doo,n)<C5 slog N {T B2, sB e) de = C5 elog N{B^,eBE)de 
Jo Jo 

<r r 1 ^^0^^ ^ [""'^QHogn 
<cq nelog — ] de + cq / de. 



\ e / Jeo 
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Using the change of variables rj = e/eq, the first integral is bounded by 
cquEq Jq r)log{r]~^) drj = cjQ^ logn. Noting that eo = csD2n~^/^ , the second 
integral is just 

C7(5^1ogn(logL'2 - logeo) = £7(5^ logn(i logn - logcg) < cgQ^log^n. □ 

We will now bound (j)n{x) using a parameter that describes the decay of 
the eigenvalues (Aj). By Assumption 3.1, the sequence of eigenvalues has a 
bounded weak ^p-norm for some < p < 1, implying that for all x > 0, 

(4.3) \{K>x}\<\\{Xi)\\p,ocX-P. 

Set Q'^{x,r) = CpA'^x^~Pr'^P\\{Xi)\\p^ac and define the function JJn{x,r) by 

Un{x,r) = CpQ(x,r)logn, 

where is an appropriate constant that depends only on p. Then, by Lemma 
3.4, Un{£x,r) < Un{x,r) and setting 

(t)n{x,r) = — --j= max\^/x,^/ELt*, — ^ I, 

it follows that for T = ri?2, we have 4>n{x) < (j)n{x,r). 

Lemma 4.10. Suppose that K satisfies Assumptions 3.1 and 4-1- There 
then exists a constant Cp, depending only onp, for which the following holds. 
Let Tr = ri?2 and set Vr to be the star-shaped hull of {Cf : f £Tr}. If Vr^x = 
{£.f€Vr:ECf<x}, then 

E||P„ - P\\Vr,a: < Cp(^„(x,r). 

Proof. In view of Theorem 4.5, it is enough to show that the sum 

00 

^2-Vn(2*+^x,r) 
i=0 

is dominated by a multiple of the first term in the sum. 

For any a > 1 and any 2; > 0, it is evident from the definition of [/„ that 

^„(ax,r)<ai/2-P/2f7„(x,r); 

therefore, one can verify that (f>n{ctx,r) < a^~P^'^(f>n{x,r). In particular, 
00 00 

2-V„(2^+^x, r) < 2I-P/2 2~'P/^r^{x, r) < Cp4>n{x, r). 

i=0 j=0 '-' 

Let us pause and explain why this analysis indeed yields a far better 
result than the L^q approach. We will show later that the dominant factor in 
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E||P„ — -P||i4 is Un/ which is, up to a logarithmic term and appropriate 
constants, 



oo \ 
^1 — > min{2;,r^Ai} | 



1/2 

= (*)■ 



In comparison, the Lqo approach leads to a bound of the order of 

r( — y~^min{x,r^Aj} 1 = (**) 

on E||P„ — -Pllvra; — which is considerably larger as r tends to infinity. 

If X is a "fixed point" of (**) (as required in the "isomorphic" result of 
Theorem 2.2), then 



i=l 



and thus x scales quadratically in r. On the other hand, the fixed point of 
(*) satisfies 



1/2 

= ex. 



Hence, if (Aj) decays quickly, then the fixed point will scale like a smaller 
power of r — in the worst case, linearly in r. 

The estimate on the fixed point in the alternative approach we presented 
in this section is the following. 



Theorem 4.11. There exists a constant Cpy depending only on p and 
11^11^2 such that the following holds. If Assumptions 3.1 and 4.1 are satisfied, 
then for every r>l, if 

^_ A\\{\i)\\l%rnogn 

and 

X> Cn ,ymax{e2/(i+f),G2/P}, 

then one has 



nPn-P\\v.,r<x/8. 
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Proof. Fix r > 1. From the definition of it suffices to find x for 

which Un{x^r) / y/n < cy min{x, -^/x}, where cy < ci min{l, (E£j«)~^/^}, for 

a suitable absolute constant ci. Note that since t = is a potential minimizer, 
cy <ci(l + (Ey2)i/2). 

The definition of ensures that C7„(x,r)/VH = 4a;i/2-p/2e. To have 

Un{x , r) / y/n < cx, therefore, it is enoug h to have x > (cp,ye)2/(i+P). bimi- 

larly, to have IJ ni^x ^r") j ^fn < cx^l"^ ^ it is enough that cx > {cp^Y®)"^^^ ■ ^ 

Corollary 4.12. Suppose that Assumptions 3.1 and 4-1 hold. There 

1 /2 

then exists a constant Cp^Y,A,x depending on p, ||l"||oo; ||(Ai)||p,oo and A such 
that the function pn defined by 

Pn{r, u) = Cp,y,A,A(l + u) maxj ^i/(i+p) ' " 

satisfies the hypothesis of Theorem 2.5. 

In particular, for every u > 0, with probability at least 1 — exp(— li), any 
function f €z F that minimizes the functional 

Pnif + KiPn{r{f),u) 

also satisfies 

P£f<mf{A{r) + K2Pn{r,u)), 

J r>l 

where 



Pn{r,u) =pr, 



vr2 ^ \ 

2r, n + In — + 2 ln(l + Cyn + log r) j . 



Proof. The corollary follows directly from Theorems 4.11 and 2.2. 
We are able to remove the B^/p term from Theorem 4.11 because ©^/p < 



2 



The feature of this new bound that makes it better than our previous one 
is the fact that the term with the worst asymptotic behavior in n has the best 
asymptotic behavior in r. Indeed, the term in pn{r,u) has a dependence 
on n that scales like 1/n, a much better rate than in the previous section. 
The significance of this is the suggestion that a regularization term of ||/||^ 
will result in over-regularization when n is large. In fact, a study similar to 
the one at the end of Section 3 shows that Corollary 4.12 is indeed far better 
(we delay the details of this comparison until after Corollary 5.5, in which 
we improve the bound even further). In the following section, we will show 
that one can improve Corollary 4.12 even further by completely removing 
the term. 
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5. Removing the r'^ term. The function p„ from Corollary 4.12 is almost 
the function we would have liked to have. Its leading term is S^/^^+p) ~ 
|-j,2p^-i ^Qg2 while the other term scales like r^/n and is dominant 

only for very large values of r. Here, we will show that the latter does not 
influence the minimization problem we are interested in and can be removed. 
Since some of the technical details of the proof of that observation are rather 
tedious and have already been presented in previous sections, certain parts 
of the argument will only be outlined. 

Let us return to Theorem 2.2. The isomorphic condition we have estab- 
lished there holds in the set F = tBh with the functional 



^(/, u) = c„y( max{e2/(i+P), G^/p} + cy (1 + u) 

\ n 

That is, for every u > 0, with probability at least 1 — exp(— u), for every 

\PnCf - tl^if, u)<PCf< 2PnCf + U). 

Consider the minimization problem one faces when performing regularized 
learning. The problem is always to minimize a functional A = Pn^f + kiV^, 
hoping that the minimizer / will satisfy 

Plf < inf A(/) = infiPlf + K2Vn), 

where the functional Vn'-H x M_|_ — )• M_|_ is nonnegative. In addition, all of 
the functionals we are interested in have the property that, for a fixed f G H 
and u G M+, Vn{f, u) tends to zero as n — )• oo. 

We will specify our choice for the functional Vn later, but, as a starting 
point, observe that since / = is a potential minimizer, it follows that (as- 
suming ||y||oo ^ 1) any minimizer of A will satisfy A(/) < A(0) < 1 -|- Vn{0), 
and the same will hold for A. Since V^(0) tends to zero as n grows, we 
can take n sufficiently large (depending on ||y||oo) to ensure that Vn{0) < 1- 
Therefore, for these values of n, any minimizer / of A satisfies 

A(/)<2 

and any minimizer /* of A satisfies 

A(r)<2. 

Thus, 

{/ : / minimizes A} C {/ : E(/ - Y)'^ < 2} C {/ : Ef < 9} 

and 

{/ : / minimizes A} C {/ : A(/) < 2} C {/ : Pnf < 9}. 



REGULARIZATION IN KERNEL LEARNING 



33 



Having this in mind, we will decompose H into two subsets. The first, Hi, 
will contain {/ : E/^ < 9}. In addition, we will show that Fr = Hin{r — 1)Bh 
is an ordered, parameterized hierarchy of Hi and that the assumptions of 
Theorem 2.5 will be satisfied with respect to a functional V{r,x) for which 
the dominant term is 02/(i+p)_ 

Thus, by Theorem 2.5, with high probability, any minimizer of A in Hi 
will satisfy 

(5.1) Pef< [nUPif + K2Vm\H,u)), 

where V is defined in a similar way to pn in Corollary 4.12. 

The next step will be to extend the result beyond Hi to H. Indeed, 
since {/:E/^ < 9} C Hi, the infimum in H of the right-hand side of (5.1) 
is actually attained in Hi. Hence, the infimum in (5.1) is really over all 
functions in H . To conclude this line of reasoning, we will then show that 
with high probability, every empirical minimizer of A is in Hi, by proving 
that if f€H\Hi, then Pnf > 9. 

The correct decomposition of H is attained using the following estimate 
for the ratio between the \\f\\H and ||/||oo for any function in H. 

Lemma 5.1. Suppose that Assumptions 3.1 and 4-1 are satisfied. There 
is then a constant ks = K3{A,p, ||(Ai)||p^oo) such that, for every f £ H, 



IE/' > 



2/(l-p) 



Proof. Recall that \\K{x, x) ||oo < 1 and let r > 0. Set /(x) = Y^°l^ tiy/Xl^ 
ipi{x), where ||t||2 = r, and observe that ||/||oo < H-f^^Hoo?^ < Also, since 
||(Ai)||p,oo < CO and is nonnegative and nonincreasing, it follows that 

for every*. A, < (||(Ai)||p,ooA)^/^. 

Fix N (to be specified later) and observe that 

(N / oo \ l/2\ 

J^ltilTAl + r^A, 
i=l \N+1 J J 

^ / 1 \ (1-P)/2P\ 



<^(^|]|t.|VA; + r||(A.)||^S(^) 

TV 

<A^\ti\^/x'i + 

i=l 



provided that A^^^^^^/^p > 2^^||(A.)||i/^7||/||^. Hence, AZti^iVK > 
/2. Note that r/||/||oo is bounded below by 1 and so we can choose an 
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integer N such that 

2Ar\m\\l&' < ^{i-p)/2p < cAr\\{X,)\\l{ l^ 



oo \\J lloo 



for some constant c depending on p and ||(Ai)||p^oo- Clearly, for any v E M^, 

||f||^jv > ||f ||£jv/\/]V and thus, 

where c\ is a constant depending on A, p and ||(Aj)||p^oo- 

On the other hand, since is an orthonormal family, we have 

i 4 o—^ \ / 



ij i=l 

Let 



□ 



Since the set of minimizers of any functional A we will be interested in is 
contained in {/:E/^ < 9}, it follows, by Lemma 5.1, that the set of such 
minimizers is contained in Hi. 

The set Hi has additional properties. There is a constant c, depending 
on p and K3 , such that on Hi , 

(5.2) ||/||oo<c||/||?,. 

Moreover, for every r > 1, if one considers Fr = Hiri{r— 1)Bh, then the min- 
imizer of Pif in Fy. = (r — 1)Bh actually belongs to Fr (again, by comparing 
to / = 0). Therefore, it is straightforward to show that Fr is an ordered, pa- 
rameterized hierarchy of Hi with r(/) = + 1, implying that one can 
obtain the desired isomorphic result on Hi, with the term replaced 
by \\f\\'^/n. 

Indeed, we can combine Theorem 2.2 with (5.2) and the fact that the 
localized averages IE||P„ — P|| indexed by {star(£^ ,0) :E/i < x} are smaller 
than the localized averages indexed by the larger set {star(£j7^, 0) :E/i < x} 
to show that for every r > 1, with probability at least 1 — exp(— n), for every 

iPnCrJ - ^ - C(l + r^n- < P^rJ < 2PnCrJ + ^ + c(l + )-, 

2 2 n 2 n 

where Crj is the excess loss associated with / relative to Fr. 
Using Theorem 4.11, one obtains the following result. 
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Corollary 5.2. There exists a constant k'^ that depends onp,A, ||(Aj)||p 
and ||y||oo) for which the following holds. If T = r^/y/n, then the function 

V'{r, n) = 4(1 + u) max{(T log n)2/(i+P) , (T log nf/P, T^} 

satisfies the hypothesis of Theorem 2.5 for the hierarchy {F^ :r > 1}. 

In particular, if we set A'{f,x) = Pn^f + kiV' {f ,u) , then, with probability 
at least 1 — exp(— u), every f that minimizes A' in Hi also satisfies 

Pef<inUPif + K2V'ir{f),u)), 
where V' is defined analogously to pn in Corollary J^.12. 

Next, we will show that the (Tlogn)^/^ and terms are nonessential. 
Indeed, for sufficiently large n, the minimal value in of A will be at most 
2 (by comparing it to / = 0). Hence, if / G satisfies K5KiTlogn > 2 [i.e., 
if ll/k >K5(?^/log^n)i/2p], then it is not a potential minimizer of A' in iif. 
(Note that we can, by increasing K4, take K5 as small as we like; this will be 
used later.) Therefore, on the set of potential minimizers, Tlogn < c, where 
c depends on ki, K4 and p. Hence, on this set of minimizers, we can bound 

y'(r,M) <K4(l + n)(Tlog7i)2/(i+P). 

Denoting the right-hand side by y(r, n), we can invoke Remark 2.6 to show 
that y(r, li) is a valid functional. 

Note that we can increase H by adding every function f € H for which 
WfWii ^ (n/log^n)(^/^P); we have already argued that such functions cannot 
minimize A. 

To conclude, if 

H[=HiU{f: \\f\\H > K,{n/log^ n)'/^^}, 
then, with probability at least 1 — exp(— u), every / that minimizes 

Pnif + KiV{r{f),u) 

in H[ also satisfies 

P£f<mi^{Pej + ^,Vir{f),u)). 

Next, let us consider the set II2 = H \ H[. Clearly, each function in II2 
satisfies < ci||/||J^^ and E/^ > 50. We will show that, with high prob- 

ability, any / € H2 satisfies Pn/^ ^ 9 and thus is not a potential minimizer 
of A in H. 
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Lemma 5.3. There exists a constant kq that depends on A, p, and 
||(Aj)||p,oo o-'^d an absolute constant kj for which the following holds. IfO G F 
and F C KQ{n/ log^ n^^'^^ B h , then, for every u> 0, with probability at least 
1 — exp(— u), for every f G F, 

where ||F||oo = supjg^ ||/||oo- 

Proof. Apply Theorem 4.11 with Y = 0, noting that, in this case, = 
It follows that we can set 

W,,r = {f^:\\f\\H<r,Kf^<x} 

and E||P„ — -P||iv^ < x/8, provided that 

X > ci max{(Tlogn)2/(i+P), (Tlogn)2/P}, 

where ci depends on A, p and ||(Ai)||p^oo- We will apply this fact for x = 2. 
That is, we need to ensure that r is chosen in such a way that 

cimax{(Tlogn)2/(i+P), (Tlogn)^/^} < 2, 

which is the case, for example, if r < C2(ra/log^ n)^/^^. 
The result now follows from Theorem 2.2. □ 

Set rn = KQ{n/ log^ n^^^^ and recall that K5 can be taken as small as we 
like. In particular, we may assume that K5 < kq and so H2 C rnSfj- 

The final preliminary step we take is to decompose H2 into Loo-shells in 
the following way. Fix n > and set tq such that K'ju{1 + Tq)/?! < 9. Define 
('"Oi^o ^« ~ 2*ro, where m is the smallest number such that > th- 
Thus, m < ci (log n + log u) . Let 

{l-p)/2p>| 



(5.3) i3= /:||/||oo>K8||/||// 



where Kg is some constant that will be named in the proof of the following 
lemma. We will consider the sets Fq = H2 H r^B^ and 

F^ = H2n{f■.ri<\\f\\^<ri+l}nB. 

Since Ui^o(-^2 n {/ : < ||/||oo < ^j+i}) = H2, any / G -^2 \UiILo-^i satisfies 

^ (l-p)/2p 

||/||oo<^8||/I|h(- 

\n 

and because WfWn I^^h, we have 

^ X l/2p /„,x{l-p)/2p „i/2 
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Therefore, 



n 



Lemma 5.4. There exist constants ci and C2, depending only on A, p and 
II ('^i) llp.oo , for which the following holds. Fix n and < u < cin, and perform 
the above decomposition. For every < i < m, with probability at least 1 — 
exp{—u), every f ^ Fi satisfies Pnf"^ > 9. Also, i/n < C2(logn)^/(-^~P), then, 
with probability 1 — exp(— u), for every / G -ff 2 \ UiLoFi, Pnf^ > 9- 

Proof. First, fix 1 < i <m and apply Lemma 5.3 to tfie set Fi. For 
every / € Fi, ||/||oo < ll-^illoo < 2||/||oo, and thus, with probabihty at least 
1 — exp(— u), 

P„f > 1e/= - 1 - ,,ii(l±Mi£) > l^f _ 1 _ 2./" + 11/11^). 

2 n 2 n 

On the other hand, for every f £ B, 

I?., 

provided that Kg > {Skj / k^)'^^^'p^/'^^ . Therefore, with probability at least 
1 — exp(— u), for every / S Fi, 

Pnf^ > 7IE/' - 1 > 10 ^ > 9 

4 n ci 

for a suitably large choice of ci. 

Turning to Fq, since Ky "^"^"*"^^^"^^""-^ < 9, we have, by Lemma 5.3, with 
probability at least 1 — exp(— u), for every f £ Fq, 

2 n 

Finally, since n~^\\H2 \ Ui^o-^*lloo — '^ "ogZ/p^^ ' follows that for our choice 
of u. 



00 ^ 



n 

from which our claim follows, using the same argument as for Fq. □ 



We can now prove our main result, which is the second part of the fol- 
lowing claim and was formulated as Theorem A in the Introduction. 
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Corollary 5.5. If Assumptions 3.1 and 4-1 are satisfied, then there 
exist constants ci, C2 and C3 that depend only on A, p and ||(Ai)||p^oo) o 
constant Nq that depends on ||y||oo o.'nd p and a constant cy that depends 
only on ||y||oo, for which the following holds. 

If n> Nq, ci log log n <u< C2(logn)^/^"^~^-*, then, with probability at least 
1 — exp(— n/2), for every f G H2, Pnf"^ ^ 9. Thus, all of the minimizers in 
H of 

(5.4) Pnlf + ^lV{f,u) 

belong to Hi. In particular, for such values of u, with probability at least 
1 — 2exp(— n/2), every minimizer f in H of (5.4) satisfies 

Pij< mf^{Pif + K2Vif,u)), 

where 

V{f, u) = C3(l + u + CYlnn + lnlog(||/||^, + e)) iy ^^^^ ) 

Let us (briefly) repeat the analysis that we carried out at the end of Sec- 
tion 3. Recall Assumption 3.2: we assume the existence of < cr < 1 such 
that the regression function E(y|X) belongs to T^L2. Recall, also, that un- 
der this assumption, the approximation error A{r) behaves like . 
Under Corollary 5.5, the error of the empirical minimizer is like (ignoring 
logarithmic terms) 

/ ^2p/{l+p)\ ^ / I ^2p/(l+p)\ 

mf i^Air) + J < mf ^4^/(1.2^) + „i/(i+p) J ' 

which can be optimized by choosing r = 72-'^/{2p+fcp+*:)^ This gives us a final 
error rate of 

^ ^ 2a/{p+2a) 



n 



which is, as promised, better by a polynomial factor than the previous error 
rate of n"^'^/^^"''^^ whenever cr < 1/2. 

APPENDIX: PROOFS 

The starting point in the proof of Theorem 2.5 is the following theorem 
by Bartlett [1]. 

Theorem A.l. Suppose that {Fr]r> 1} is an ordered, parameterized 
hierarchy and that Pn{r) is a positive, continuous, increasing function. If, 
for all r >1 and all f ^ F^, 

(A.l) \PnC-rJ - Pn{r) < PCrJ < 2PnCrJ + Pn{r), 
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then 

where f is any function that minimizes the functional Pn^f + C2Pn{f{f))- 

Proof. Let {ri)'^i be an increasing sequence (to be determined later) 
such that ri = 1 and rj — t- oo as i — )• oo. Define, for each i > 1, Ui = u + 
ln(7rV6) +21nz. Then 



i=0 

and so, by the union bound, with probabihty at least 1 — e~", for every i > 1, 

\PnX.r,J - Pn{ri,Ui) < PCr^J < 2P„£ri,/ + Pn{ri-,Uj). 

If we only cared about a sequence of rj, this would be enough for our 
result. However, we need an almost-isomorphic condition for all r > 1 and 
so the next step must be to find an almost-isomorphic condition for when 
r € [rj_i,rj]. In one direction, we have 

PCrJ = PCr^,f-PCr^Jf 

< 2P„£rj J + Pn{rj,Uj) - PCr^Jf 

(A.2) = 2PnCrJ + 2P„£,. . J. + pn [vj ,Uj) - . J, 

< 2PnCrJ + 5pn{rj,Uj) + SPCr^J* 

< 2PnCrJ + 5pn{rj,Uj) + 3PCrjJ^^_^ , 

while in the other direction, we get 

2PCrJ = 2-P-Crj,/ — '^P^TjJ* 

> PnCr^J - 2pn{rj,Uj) - 2PCr^Jf 

(A.3) = PnCrJ + PnCr^ Jf - 2p„ {rj ,Uj) - 2PCrj J* 

> PnCrJ — ^Pn{rj,Uj) — ^PCr^J* 

> PnC-rJ — ^Pn{rj,Uj) — ^PCrjJ*^__^ ■ 

We can now choose our sequence r^: recall that ri = 1 and set r^, for all 
i > 2, to be the largest number satisfying both of the following inequalities: 

ri < 2rj_i, 

(A.4) 
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Note that choosing the largest number is not a problem because both Pn{r, u) 
and PCrj* ^ are continuous functions of r; that is, the supremum of the 

set of r satisfying (A. 4) is attained. 

Our choice of rj ensures that, for all i > 1, 

(A.5) i< — — +log2rj)< — — -+log(2ri. 

Pn{ri,Ui) Pn{ri,Ui) Pn{ri,ui) 

Indeed, for i = 1, this is trivial. For larger i, we can proceed by induction: our 
definition of r, ensures that either = 2rj_i or P£{f*. _^,Y) = Pi{f*.,Y) + 
Pn{ri,Ui). In the first case, logr, = logrj_i + 1 and the inductive step follows. 
In the second case, assuming that 

« - 1 < 7 T 7 T +logri_i, 

Pn[ri,Ui) Pn{ri-l,Ui-i) 



it follows that 



i < — r 7 r + 1 + log(2rj 

< — r 7 ^ + 1 + log(2ri) 

Pn{ri,ui) Pn{ri,Ui) 



Pe{f*^,Y) PiUn,Y) 



+ log(2ri), 



/On(n,Wl) Pn{ri,Ui) 

which proves (A.5) by induction. In particular, for any i>l and any r >ri, 
Ui < 9{r,u). Therefore, 

Pn{ri,Ui) < p„{2r,9{r,u)) 

for any r G [rj_i,rj]. 

Note that (A.5) implies that the sequence tends to infinity with i. Then, 
by (A.2), (A. 3) and (A. 4), with probability at least 1 — e~", for all r > 1 and 
all/GF„ 

^PnCrJ - 4pn{2r, 0(r, u)) < PCrJ < 2PnCrJ + 8p„(2r, d{r, u)). 
We conclude the proof by applying Theorem A.l. □ 
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